Selection, Mutations and Codon Usage in Bacterial Model 



Franco Bagnoli* 
Dipartimento di Matematica Applicata, 
Universita di Firenze, 

via S. Maria 3, 
1-50139 Firenze, Italy. 

Pietro Lio 

Dipartimento di Biologia Animale e Genetica, 
Universita di Firenze, 

Via Romana 17, 
1-50125 Firenze, Italy 

Abstract 

We present a statistical model of bacterial evolution based on the coupling be- 
tween codon usage and tRNA abundance. Such a model interprets this aspect 
of the evolutionary process as a balance between the codon homogenization 
effect due to mutation process and the improvement of the translation phase 
due to natural selection. We develop a thermodynamical description of the 
asymptotic state of the model. The analysis of naturally occurring sequences 
shows that the effect of natural selection on codon bias not only affects genes 
whose products are largely required at maximal growth rate conditions but 
also gene products that undergo rapid transient increases. 

appeared as J. theor. Biol. 173, 217 (1995). 

I. INTRODUCTION 

The evolution of the genetic material is determined by the interactions among mutations, 
random drift and natural selection. 

The mutation rate seems to vary both among and within genomes, being affected by 
many factors, such as the chromosomal position (Sharp et ai, 1989), the G+C content 
(Wolfe, 1991), the nearest neighbor bases (Blake et ai, 1992), the different efficiency of the 
repair systems between the lagging and the leading DNA strands during replication and 
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transcription (Veaute & Fuchs, 1993). Thus the molecular clock seems to tick at different 
rate for different DNA positions. 

The natural selection acts as a driving force at virtually all levels of the genetic in- 
formation processing and biological organization: from the DNA stability, replication and 
transcription to messenger RNA life span and translation efficiency, to the correct function- 
ing of the gene products in the building up and propagation of a living organism. Although 
in principle all these constraints could interact in a very complex way, it is indeed fruitful 
to try to untangle the role of each element. 

We have focused on the translation process as a major source of fitness for the bacterial 
cell. 

Because of the degeneracy of the genetic code and the historical process of codon capture 
(Osawa & Jukes, 1989), different codons (synonymous codons) specify for the same amino 
acid. The bias in the codon usage is generally not random and it is very marked in some 
procaryotic and eucaryotic genes. The codon choice is specie-specific (Grantham et al, 
1980), being roughly the same, although with different intensities, in all the genes within a 
genome. The codon bias is thought to be an effect of the improvement of the efficiency of 
the translation process under natural selection. 

Several authors have stressed that synonymous codons in E. coli, preferentially used in 
highly expressed genes, are translated by the most abundant iso-acceptors tRNA species 
and that the levels of tRNAs in bacterial cells depend on the amino acid usage in proteins 
(Ikemura, 1981a). 

A good correlation between codon usage and tRNAs population has also been found in 
Mycoplasma capricolum (Yamao et al, 1991), in yeast (Ikemura, 1981b) (Ikemura, 1985) 
and in chloroplast genomes (Morton, 1993). 

The importance of codon usage as a substrate for natural selection is well proved by the 
coding strategy of the coli phage the early expressed genes of T4 have a codon pattern 
close to that of E. coli while the late expressed genes show a preference towards the codons 
read by the tRNAs expressed by the genome of the phage itself (Cowe & Sharp, 1991). 

The codon usage influences the elongation rate in the translation process, being the 
pairing between codon and anticodon of the diffusing tRNA the rate-limiting step compared 
with the peptide bond formation and translocation steps (Gouy & Grantham, 1980), but 
it cannot be directly related to the protein expression rate. In fact many experiments have 
demonstrated that other crucial steps affecting the protein production rate are the selection 
of mRNA by a ribosome, the initiation phase and the phenomenon of polypeptidyl-tRNA 
drop off (Menninger, 1978). 

Moreover the dynamic of elongation phase seems to be more complex, depending also by 
the differences in the energy of codon anticodon pairing (Grosjean & Fiers, 1982) (Thomas 
et al., 1988); thus differences in elongation rates have been found for codons read by the 
same tRNA species (Sorensen & Pedersen, 1991). 

Different considerations should be drawn on the evolutionary role of either the minor or 
the major codons. Some authors have suggested that selection of some of the rarer tRNA 
species may be the rate- limiting step in protein synthesis (Varenne et al, 1984): they hy- 
pothesize that the rate of polypeptide elongation could act as a regulatory mechanism in 
gene expression when clusters of rare codons undergo translation, i.e. when the required 
tRNA concentration is low and rate-limiting. This hypothesis has been confirmed exper- 
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imentally by manipulating the concentrations of some tRNAs and creating rate-limiting 
conditions. Varenne (Goldman, 1982) (Varenne et al, 1984) demonstrated experimentally 
that the presence of rare codons in highly expressed coding regions is associated with pauses 
in the synthesis of proteins. Many different contexts have been tested with similar results: 
Hoekema (Hoekema et al, 1987) replaced some major codons present in the first region of 
the PGK1 gene of Saccharomices cerevisiae with synonymous minor codons; Kinnaird (Kin- 
naird et al, 1991) generated a cluster of three rare codons in the GDH gene in Neurospora 
crassa. Chen and Inouye (Chen & Inouye, 1990) have demonstrated that the introduction 
of a cluster of a rare codon, AGG, in lacZ sequence of E. coli, lowers the expression of the 
gene; moreover this effect is inversely correlated with the distance between the initiation site 
and the position of rare codons. 

Models that takes into account the influence of minor codons in the initiation region of 
mRNA for protein production rate have been presented (Liljenstrom & von Heijne, 1987) 
(Liljenstrom & Bomberg, 1987). 

On the contrary Sharp and Li (Sharp & Li, 1986) have hypothesized that the pattern 
of synonymous codon usage in regulatory genes reflects primarily the relaxation of natural 
selection. In fact, the absence of rare codons from highly expressed genes may well result 
from negative selection, while in low expressed genes, selection against rare codons is very 
weak and so they can accumulate under the pressure of mutation. 

Remarkably different considerations should be drawn for the major codons, i.e. codons 
read by the most abundant tRNAs. There is evidence that the major proteins are translated 
faster than other proteins and that the elongation rate at major codons is faster than that 
at other codons (Emilsson & Kurland, 1990a). 

Some authors (Emilsson & Kurland, 1990a) (Emilsson & Kurland, 1990b) have hypoth- 
esized that growth rate, in bacteria, depends on tRNA abundance. Such relation results in 
the so called growth maximization strategy and it is consistent with a major codon prefer- 
ence strategy, in which an optimal subset of codons is thought to be used very frequently in 
highly expressed genes, and their cognate tRNA species are supposed to increase consider- 
ably at high growth rates conditions. This is in agreement with the fact that the genes for 
the major tRNA species are located inside rRNA operons for a coordinate expression. 

The growth rate dependence on tRNA abundances and other aspects of the evolution 
of genes and genomes in bacteria can be faced with statistical mechanics tools. Let us 
consider an homogeneous population of bacterial genomes, subjected only to synonymous 
mutations and natural selection. If we study its evolution in the proximity of a steady 
dynamical state, as for instance a balanced growth state, we can interpret the interaction 
between mutations and natural selection as a competition between randomness and order. 
In this spirit we develop a statistical model from which bacterial evolution emerges as an 
optimization process under environmental constraints. 

II. THE MODEL 

In order to study quantitatively the competition between mutation and natural selection 
in naturally occurring sequences, we concentrate on the process of the elongation phase. 
We suppose that the rate limiting step in this phase is the relative abundance of charged 
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tRNAs. Thus we assume that the time required by a ribosome to process a codon is inversely 
proportional to the abundance of the charged cognate tRNAs in its vicinity. Taking into 
account the data from Gouy and Grantham (Gouy & Grantham, 1980), the process of 
tRNA diffusion in cell extract results much faster than all others processes involved in our 
schematization, and thus we do not consider the effects of spatial gradients. 

Furthermore, we assume that the efficiency of aminoacyl-tRNA synthetases is not rate 
limiting, i.e. that the concentration of the charged tRNAs does not vary with the trans- 
lation process regardless of the codon composition of mRNAs. This approximation is not 
completely fulfilled in real bacteria. From the same source above (Gouy & Grantham, 1980) 
we obtain that in average only 85% of the tRNA pool is acylated, indicating that the reaction 
rate of synthetase is not substrate limited. 

In this approximation we do not take into account the finite size of a ribosome, nor the 
queuing of ribosomes, thus disregarding the positional effects of codon usage, as considered 
elsewhere (Liljenstrom & Bomberg, 1987). 

With this assumption, the translation time does not depend on the order of codons in 
mRNA and can be calculated as the sum of the time required for each codon. Finally, we 
do not consider the influence of fluctuations, i.e. we develop a sort of "mean field" model 
for the translation. 

We indicate the portion of mRNA that code for a protein (or, alternatively, the whole 
mRNA) with a vector c; Cj (i = 1,... ,61) being the number of codons of type i and 
L = J2i c i the length of the coding region. We denote with the relative abundance of 
acylated tRNA of type k (k = 1, . . . ,n t RNA', Y^k=\ A a k = !)• For E. coli n t RNA = 45, see 
also the following section. The "translational efficiency" of codons by tRNAs is represented 
by means of a matrix of size 61 x n t RNA- Again from (Gouy & Grantham, 1980), the 
transpeptidation and translocation phases do not weight heavily on the tRNA cycle, and 
disregarding them, the mean time required to translate the codon i is inversely proportional 
to pi = X]fcIfi VA Rik&k- The mean translation time per codon r (in arbitrary time units) is 
thus given by 



For simplicity, the elements Rik can be taken to be either 0, for a tRNA k that does not 
pair with a codon i, or 1, disregarding the differences in energies in the codon-anticodon 
pairing discussed in the previous section. 

In bacteria, cell division and DNA replication are coupled with cell growth (Zyskind & 
Smith, 1992) (Ageno, 1992). In balanced growth conditions the average number of genomes, 
the average number of cells and the cell mass increase exponentially with the same rate. 

The cell growth largely depends on the production rate of the most abundant biopoly- 
mers, among which ribosomal proteins constitute the larger fraction. The time r in formula 
(H), calculated for an "average" ribosomal protein, can be considered proportional to the 
mean duplication time. We assume that the natural selection tends to lower this time. A 
somewhat opposite effect is due to the mutation process, that tends to randomize the codon 
sequence. Since silent mutations do not change the coded aminoacid and thus they do not 
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strains. 



affect the composition of proteins, they are neutral for the selection over the protein func- 
tionality. However, silent mutation are not neutral for the calculation of the duplication 
time r, because of the presence of the abundance of cognate tRNAs in formula ([]]). 

Let us start from a simplified situation, in which two synonymous codons (say codon 
and 1) are only read by two tRNAs (with respectively abundances ao and ai), while all 
other codons are grouped together (say codon 2) and in average are read by tRNAs with 
abundance a>2 (a + a\ + a2 = 1). The number of synonymous codons is Co + C\ = I. We 
consider only mutations between and 1 and vice versa. Since r does not depend on the 
order of the codons in the string, the various strains can be grouped together according with 

the number , of one's in the mRNA (,• = „). For each group j there are = (<) 

We rewrite the formula (|T|) for the time Tj required to duplicate a bacterium belonging 
to group j as 

( 3 J-3 ■ , M 

Tj = h + =rj + q + z; 

\a\ ao a2 / ^ 

a — a.\ I Kl L — I 

r = ; q=— ; z = — ; K = ——. 

a ai ao &2 * 

The quantity K~ x is the relative abundance of the synonymous codons and 1 in the 
coding region. For / <C L, K — > oo; for I ~ L, K ~ 0. Since selection only acts on and 1 
codons, K~ x give an indication of the influence of selection on the gene. 

In the unit time interval, the number of duplications Vj of the mass Mj of bacteria 
belonging to group j is Uj = r" 1 , and thus 

-if = ^ < 3 ) 

where for simplicity we set all constants to 1 and we neglect the time dependence of Mj. 

In the limit of very low mutation rate, we can assume that for each generation there is 
at most only one synonymous mutation, that changes codon 1 to or vice versa. Let us 
indicate with the rate of mutation per codon (/i < 1). The average fraction of bacteria 
in group j that undergo a mutation per unit of time is [iVj . For bacteria in group j, a 
synonymous mutation changes j of one unit. Including mutations, and working with the 
mass mj = Mj/gj of a single strain in group j, formula (^) becomes 

dun ■ j I — j 

— - = (1 - i^Ujiifij + -fjiUj-xmj-i H —fiu j+1 m j+1 ; (4) 

for < j < I, assuming that z/_i = vi+\ = 0. 

The total mass of the bacterial population is M = Xlfc=ofi ,fcmfc - We can derive from 
eq. (|P the evolution equation for the distribution probability pj = nij/M of different 
strains in the total population. In a natural environment the exponential growth periods 
are sporadic, generally followed by starvation phases. We mimic this alternation by means 
of the normalization of distribution, assuming that the strains corresponding to a very low 
probability are those eliminated by natural selection. 
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Since mutations do not change the total mass of population, we have, summing up over 
j in eq. (|) 

dM , ^ ,<-\ 

9kVkm k = M 9kVkPk = Mv\ (5) 



obtaining 



and thus 



dt 

k=0 k=0 



dpj d rrij 1 drrij rrij dM 
~dt = ~dt~M = mUT ~ M 2 dt ' 



dpi r/i \ -i J l~j 

= L(i - w - Hpj + jW-iPj-i + —j—^j+iPj+u 



(6) 

f = ^gkVkPk- 

k=0 

Before dealing with these equations from a mathematical point of view, let us put some 
thermodynamical considerations. The scenario is reminiscent of statistical mechanics sys- 
tems, in which there is competition between order (the energy function to be minimized, 
related to the average duplication time) and the entropy, mutuated by the temperature. 

Each strain in group j contributes with Tj to the mean duplication time. Taking 
into account the analogy between the duplication time of strains and the energy lev- 
els of a statistical system, in the equilibrium (stationary) state we can tentatively apply 
the methods from equilibrium statistical mechanics. The maximization of the "entropy" 

5 = — X)fc=o^fcPfc m ^ under the constraints Y^k=$9kPk = 1 (normalization of probability 
distribution) and Y^k=o 9kPk T k — const (in order to select the most probable probability 
distribution within those with the same fitness), gives the Boltzmann distribution (Landau 

6 Lifshitz, 1958) 

Pj = C exp(— /3tj). 

The Lagrange multiplier (3 can be considered as the inverse of an effective "temperature" 
T, that, intuitively, should be related to the mutation rate \i. In the stationary state, v is 
constant in time, and from eq. (||), M (and thus rrij) grows exponentially. 
Inserting the Ansatz 

rrij = M exp (at — /3tj) 

(M is the total mass at time t = 0) in the equation (]4j), and using the relations (||), we get 
for the asymptotic state 

(1 - n)vj -a + l —j^-/j,iy j+1 x + ifu/j-!^ = 0, (7) 

where x = exp(—/3r). Approximating Uj+i ~ v j-\ — v ji with an error of order r/rj, which is 
small for I or L large or a\ — a Q small, eq. (]7|) becomes a linear equation in j. This equation 
holds independently of j if 



x 



-(1 - n)lr + ^-/iflV + V^ + lr) 

2n(q + lr) ' / g N 

a(l-x 2 ) 



Irx 
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Using the relations @ we get 

(1 - ju)(ai - a ) + \J{\ — /i) 2 (a.i - a-o) 2 + 4/i 2 a ai(l + i^a )(l + Ka\) 



x 



2/ia (l + Ka\ 



where 5j = aija^ i — 0, 1. 

Note that a = z/. Numerical simulations of eq.(|6]) agree very well with this solution, and 
show that it is the only stable solution for almost all initial distributions. 

The asymptotic form of Mj = gjrrij for K = 0, / — > oo and 1 j <§C Z is 



2 l y/2 ( (2]-l) 2 
Mj = M ^= exp I at - }ljL ^ J - ~ fa 



It is remarkable that although the distribution of pj is always a growing function of j (for 
ai > a ), the distribution Mj is bell shaped, due to the contribution of the multiplicity 
factor gj. 

The most probable group j max corresponding to the maximum of Mj is, using the Stirling 
approximation for gj, 

Ix 

J max ; T ■ 

X + 1 

For L ^> / (K — > oo), a; = 1, T — > oo, a — and j max = 1/2. The distribution pj is 
flat because in this case the selection has no effect. This is due to the effective neutrality of 
mutations inside a group j. 

Let us consider in the following the opposite case L = I (K = 0), that represents a 
bacterium in which an essential protein is totally composed by the single aminoacid coded 
by and 1 codons. This rather unrealistic case makes our model equivalent to a ID kinetic 
Ising model (Kawasaki, 1972) in an external field without interactions among spins. In this 
case we can analize in more details the influence of selection. 

We can express a and (3 in term of e = ai — a$: 

(1 - fi)e + v/(l -Ai)% 2 + /i 2 (l -e 2 ) 
x = 



a 



ti(l-e) 
(i{x 2 - - e 2 ) 



Aelx 

1-e 2 
(3 = \n(x) 

with < £ < 1. 

For e -> (oi ~ a , r -> 0) we have T = f3~ l = 4/i + 0(e 2 ) and a = 1/21. This 
relationship confirms the intuitive corrispondence between temperature and mutation rate, 
at least for moderate differences in the tRNA abundances. For /x — > 0, we get j max = /, 
corresponding to a distribution dominated by the fitter strains. Increasing the mutation rate 
jmax decreases; extrapolating to infinite value of /i (beyond the validity of our model), we 
get j max = 1/2, and the influence of selection is vanished by mutations. 
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For e — > 1 (ax ^> a 00 , r — > — oo), T — > oo and a = (1 — //)//. In this limit only the fittest 
group j = j max = I survives. In this case all mutations are deleterious, and their effect is to 
lower the growth rate a of the total mass of bacterial population. 

At this stage of development, our model is too simplified to allow quantitative com- 
parison with experimental data. However, it can be used as an interpretative tool for the 
understanding of the mutation/selection dynamics. 

III. PERTINENT DNA SEQUENCES ANALYSIS AND DISCUSSION 

In order to estimate the degree of optimization between codons and tRNA species we 
have analyzed 1530 Escherichia coli coding regions. 

The choice of analyzing mainly the E. coli coding sequences depends not only on the 
large amount of data in literature and the large number of genes that have been sequenced 
(about 35% of the genome, at this date) but on the fact that in E. coli genome the effect of 
a directional mutational pressure towards an increase in G+C or A+T content is very little 
(about 51% G+C content), as compared with other organisms like Micrococcus luteus (74% 
G+C content) or Mycoplasma capricolum (25 % G+C content) (Sueoka, 1993) (Osawa & 
Jukes, 1988b). This pressure, probably due to mutations in the DNA polymerase system, 
acts particularly in shrinking the codon and anticodon sets: in E. coli there are 75 tRNA 
genes, and 45 types of anticodons, while in Micrococcus luteus or Mycoplasma capricolum 
these numbers are much lower (29 and 33 tRNA genes, resp.). Besides, the large effects of 
the directional mutational pressure on genomes composition probably results in weakening 
other existing functional constraints. 

For each gene we have considered the simple relationship (0) between codons at position 
i, (q) and tRNA molecule abundances {pi). We have considered equal translational efficiency 
for all codons. 

The effective tRNA species abundances at different duplication rates are not exactly 
known; little differences in the total tRNA abundances have been proved experimentally 
among different strains of E. coli (Jakubowski, 1984), but the relative abundances do not 
seem to vary largely even among different species of enterobacteria: the data by Ikemura 
show that there is a good correspondence between the relative abundances of the tRNA 
species between E. coli and Salmonella typhimurium. Thus we have approached the problem 
performing the calculations in two ways: 

a) considering the data on the relative abundances of the tRNA species published 
by Ikemura (ti) (Ikemura, 1981a) (Ikemura, 1981b) (Ikemura, 1985) and Jakubowski 
(Jakubowski, 1984). 

b) using the data on tRNA gene dosage published by Komine (tk) (Komine et al, 1990). 
Both these two implementations are affected by different approximations. 

In case (a) the data determined by Ikemura are not exhaustive. We have approximated 
the abundances of the minor tRNAs at the value 0.1 with respect to the abundance of Leu- 
tRNA normalized at 1. In case (b) we have considered the abundance of tRNA species to 
be just proportional to the number of tRNA genes; this assumption could be not completely 
true because different tRNA gene clusters are regulated by different promoters and because 
of differences in aminoacyl-tRNA synthetases abundances and catalytic activities. 
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In Fig. 1 we report the average time spent per codon t\ for 1530 coding sequences, both 
for the correct reading frame and for the +1 and +2 reading frames. The sequences are 
ordered according with the value of T\ of the correct reading frame. 

In Fig. 2 we report r K for the same sequences of Fig. I. 

The comparison of the two figures shows that there is a good agreement between the 
values of t\ and tk, although the data obtained using the distribution of tRNA abundances 
from Ikemura show a more marked separation between the correct reading frame and the 
+ 1 and +2 frames. This indicates that there is a good correspondence between tRNA 
abundances and number of copy of the related tRNA gene, i.e. the most abundant tRNAs 
are expressed by triplicated or quadruplicated genes while the less abundant tRNA species 
are expressed by genes present in a single copy. 

In all the analyzed coding regions, the correct reading frame always presents the lower 
crossing times, although the differences are more marked for the highly expressed sets: the 
mean translation time of the correct reading frame for highly expressed coding regions (as 
ribosomal genes, see Fig. 1) is lower than those calculated for non highly expressed coding 
regions, while the value for the +1 and +2 frames are equal or sometimes greater. 

We have also compared our results with the Sharp and Li's Codon Adaptation Index 
{CAI) of each sequence. This index does not depend directly on the abundances of tRNA 
but it is a measure of the bias in codon usage (Sharp & Li, 1987) defined as follows 



where the product of Wk = w(ik) is taken over the code under examination. The quantities 
w(ik) for a codon i at position k is given by the relative frequency of the codon i with 
respect to the most used codon for the same aminoacyd in a set of highly expressed genes. 
Note that in principle is possible to sinthetize artificial sequences made only of major codons 
with a CAI value greater than that of the most abundant proteins. This occours because 
this index neglect the influence of mutations. The same misunderstanding occours with our 
quantity r. 

In Fig. 3 we report the CAI values for the same sequences of Fig. 1. The agreement 
with both the T\ and tk is rather good. 

The genes that show a highly biased codon usage have high values of CAI (and low values 
of Ti) and belong to the translation and transcription machinery, as for instance genes coding 
for ribosomal proteins, initiation and elongation factors, heat shock or stringent response 
proteins, genes involved in the core of intermediate metabolism and genes coding for the 
most abundant membrane proteins. 

Generally the genes whose products are proved to be required for maximal growth rate 
show high value of CAI. We can consider for instance the 118 IF2a and 71 /F^/3(Sacerdot 
et al, 1992) that operate in the initiation of the translation process and probably affect 
the production of ribosomal RNA, and the product of the nusA gene that is involved in 
the transcriptional process. These three genes are clustered in the same operon, probably 
for a co-regulation of translation and transcription. The apices indicate the position of the 
sequences in the figures. 
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The DNA polymerases I and II are present in different concentrations; about 400 
molecules of polymerase I per cell with respect to 17-100 molecules per cell of polymerase 
II (Adams et ai, 1992). The CAI values of the two genes ( 334 polA and 7m polB) reflect the 
difference: 0.403 (n = 2.29) versus 0.352 (n = 2.89). 

Since biosynthetic operons are quite constitutively expressed, except during starvation 
conditions, the genes coding for the repressor proteins show CAI values lower than repressor 
genes in catabolic operons. 

In some cases genes coding for functional aggregates constituted by structural proteins 
interacting in a precise stoichiometry ratio show values of CAI or tj close to this ratio. This 
relation is true only for highly expressed genes; for example some ribosomal proteins, as 1 L7 
and U L20, are present as a tetramer in each ribosome (Adams et ai, 1992): the CAI value of 
the correspondent genes are higher than those of other ribosomal genes; the same correlation 
has been found for ATP synthetases genes that code for a large complex constituted by many 
copies of different subunits: there is a good agreement between the CAI values of the genes 
and the stoichiometry of the subunits in the complex. This correspondence almost vanishes 
for structural proteins coded by low expressed genes as those coding for fimbriae, pili and 
flagella. 

Generally, the enzymes that channel a metabolic pathway do not present the same bias 
in codon usage, probably because natural selection acts primary on the homogenization of 
the catalytic activity of the metabolon, however there are also few interesting clues of the 
contrary. 

The membrane lipids biosynthesis in E. coli involves the action of at least 25 genes; 
many of these are clustered in the fab operon. Although the regulation of this pathway is 
very complex and not yet completely elucidated, with regard to both the total fatty acid 
content and the phospholipids composition, there are some observations that indicate that 
the activities of few enzymes (acetyl-CoA:ACP transacetylase, acetyl CoA carboxylase) are 
rate limiting in vivo (Magnuson et ai, 1993). We have found that the bias in codon usage 
of these genes is higher than that of the genes coding for the other enzymes of the pathway. 
Since growth rate depends also by membrane lipids production, the bias in the bottleneck 
of the pathway probably increases the rate of production, while its effect on the other genes 
is vanishingly small. 

Other interesting examples regard the relatively high bias in codon usage in genes like 
24 spoT, 180 htpr, 280 uspA, 295 fis, 130 cspA and 175 ackA. 

Since spoT encodes for a ppGpp pyrophosphohydrolase, it takes into account for the 
degradation of ppGpp, the major inductor of the stringent response, that accumulates during 
starvation periods. 

Nutritional shift-up experiments, in which large amount of nutrients are added to bacteria 
growing in a minimal medium, have revealed that there are immediate changes in the rate 
of increase of cell mass while the cell division rate continues for some time at pre-shift rate 
(Cooper, 1991). 

Thus the competition among bacteria requires to exit quickly from the starvation periods 
and to restart growing when conditions have changed; in this sight the spoT gene (CAI= 
0.59, T\ =1.4 ) could be a very important target for natural selection. Probably for the same 
reason the htpr gene (CAI = 0.553, T\ = 1.95) has a relatively highly biased codon usage: 
the encoded protein coordinates the global response of the cell to heat shock events. In fact 
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this gene codes for a sigma-32 subunit that substitutes the sigma-70 in the core of RNA 
polymerase making the new holoenzyme capable of activating the heat shock promoters. 
Henceforth other heat shock genes show a biased codon usage as for instance 485 dna J and 
462 groEL. The intracellular amount of the universal stress protein, coded by the gene uspA 
(CAI = 0.52, n = 2.19), greatly increases in the cases of exhaustion of nutrients like carbon, 
nitrogen, phosphate, sulfate or aminoacid (Nystrom & Neidhardt, 1992). 

We hypothesize that codon usage reflects not only the effects of selection for high growth 
rate conditions but also selection for very fast, and perhaps short time lasting, response to 
equally rapid environment changes. A clue that agrees well with this hypothesis concerns 
the fis gene (CAI = 0.53, T\ = 2.21) whose product functions as a DNA binding protein, 
involved in recombination reactions and in rRNA transcription. It has been reported that 
when the availability of a rich medium allows the bacterial cells to exit from the stationary 
phase, the amount of the fis gene product increases from less than 100 copies to over 50000 
copies before the first cell division. As the exponential growth goes on, the Fis levels decrease 
considerably (Ball et al, 1992). The gene cspA (CAI = 0.808, T\ = 1.81) codes for the major 
cold shock gene (CS7.4) (Tanabe et al., 1992); when the temperature shifts from 37 °C to 
10 °C the level of the protein largely increases and becomes about 13% of the total proteins 
in the bacterial cell. 

It is noteworthy that since acetyl phosphate seems to be a global regulator of signal 
transduction in E. coli, also the gene ackA (CAI = 0.66, T\ = 1.9) that codes for a protein 
that synthesizes acetyl phosphate from ATP and acetate presents a high value of CAI . 

There are genes with little if any differences in values of t\ between the correct reading 
frame and the others. This is mostly true for genes in plasmids (see Fig. 1). Since plasmids 
and conjugative transposons play a key role in the interchange of drug resistance and viru- 
lence traits among bacteria, and generally in the mobilization of the genetic material, it has 
been proposed that the entire pool of genetic information could be accessible to all members 
of the bacterial community, considered as a single, heterogeneous organism. In this regard 
the emergence of a bias in codon usage for a gene carried by a plasmid is treated unfairly by 
natural selection if the rate of the genetic information interchange among conjugative bac- 
teria of different species is sufficiently high. In fact because of the genetic flux, the plasmid 
could visit a wide range of hosts, thus it could assay different molecular environments and 
therefore the recombinant processes between homologous genes maintain sequence homo- 
geneity. Furthermore, an increased expression of a gene could be attained also by increasing 
the number of copies of the plasmids . 

The debate between neutralists and selectionists in evolutionary and population genetics 
addresses the question of how the superposition and the interactions between the different 
constraints acting on the coding regions allow the tremendous amount of molecular genetic 
variation that natural bacteria populations exhibit. 

The comparison of the standard deviation of the CAI index for thirteen sequences of D- 
glyceraldehyde-3-phosphate dehydrogenase gene ( 33 gap gene; average CAI = 0.830, standard 
deviation 0.0025) from different strains of E. coli sequenced by Selander ( Selander et al, 
1991) and ten sequences of the alkaline phosphatase gene ( 843 phoA gene, CAI = 0.344, 
standard deviation 0.0059) sequenced by Milkman (DuBose & Hartl, 1991), shows that 
genes with higher bias in codon usage present lower levels of silent polymorphism. 

This relation suggests that the more biased the codon usage of a gene among a bacteria 
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population, the less neutral the synonymous mutation. 

Another fundamental question regards if and in what measure the bias in codon usage 
participates in the control of gene expression. E. coli and bacteria in general are governed by 
robust and plastic regulative mechanisms and even if regulative circuitry are parsimonious 
in design, the present day gene networks have a large tool-kit of regulative mechanisms, 
both at the gene-protein and at the protein-protein interaction level. 

Although the basic mechanisms act at the operon level, there are global mechanisms that 
allow the cell to respond to challenges, as hunger or stress situations, with the coordinate and 
concerted expression of a network of operons. These mechanisms rely mainly in a cascade- 
like activation of autogenously regulated stimulons or in the sigma subunit variability that 
makes the polymerases capable of recognizing different types of promoters. 

Within the operon there are mechanisms that differentiate the expression at the single 
gene level, as for instance internal promoters, transcription termination signals, ribosome 
binding sites with different efficiency, mRNA degradation signals etc. 

Although bias in codon usage has been proved to play a role in differentiating the ex- 
pression levels of the genes within the operon, since these mechanisms act sequentially, the 
effect of codon bias on the translation process probably depends largely on the optimization 
of the other factors involved in the gene expression. 

The amounts of the genes that present high values of CAI, for instance the ribosomal 
and major outer membrane genes whose products are strongly required with respect to 
other proteins for high duplication rates, are the effective fitness bottleneck of the bacterial 
duplication machinery; thus the growth maximization strategy allows a cell to duplicate 
faster because it lowers the time needed for translation. 

The natural selection acts so much stronger in optimizing the translation of the very 
highly expressed genes, with respect the other genes, because further increases in the trans- 
lation efficiency of these latter genes reflects smaller improvements in the fitness of the cell. 

Considerations of this type have also been proposed and tested by Dykhuizen and col- 
laborators for lac operon in E. coli (Dean et al, 1986); they have demonstrated that the 
contribution to the fitness of the proteins encoded by the lac operon is quite different, being 
the bottleneck of the lactose metabolic pathway the flux of lactose into the cell, governed by 
the concentration of the enzyme lac permease: thus the increase of catalytic activity of the 
other enzymes coded by the lac operon (for example beta-galactosidase) results in a very 
little increase of the global fitness of the pathway. 

This could be a general principle in gene network architecture, i.e. the potentiality of 
increasing the global fitness of the pathway could not be equally distributed among the genes 
belonging to a gene network. 

Being generally accepted that the weaker the selective constraint, the wider the random 
genetic drift (Li & Graur, 1991), the more relaxed situation of genes whose product are not 
required in large amount, allows high genetic variability consisting mainly in synonymous 
substitutions. 
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IV. CONCLUSIONS 



We have modeled some aspects of the molecular evolution of bacteria genomes by means 
of the coupling between codon distributions and tRNAs abundance and the competitive 
effects of mutation and selection. Our model is consistent with a thermodynamical inter- 
pretation of this process, and gives a matematical support to the observed codon usage 
distribution. 

The evolution of a non synchronized population of bacteria is schematically drawn as 
alternated phases of exponential growth (feast) and selection (famine), due either to star- 
vation or to external reasons. In bacteria, because of the limiting amounts of nutrients 
and the rapid fluctuation in their availability in natural environment, the periods of growth 
are sporadic. Indeed, during these sporadic periods, the number of bacterial cells tends to 
increase exponentially and the competition between genotypes could develop very hardly, 
because a light difference in the fitness could result in a large difference in number and in 
the affirmation of the fitter genotype. 

Probably the two extreme conditions of feast and famine model large part of the fitness 
function of the bacterial genomes. 

Bias in codon usage seems to be an adaptive response not only to feast periods but also 
to conditions of changing surroundings. In fact to survive famine periods would require 
sophisticated regulative mechanisms leading to a shrewd management of the resources as for 
instance the stringent response, while the selection for rapid oscillating periods make the cells 
maximize the sharp up-shift or down-shift production of some specific proteins. According 
to this hypothesis, bias in codon usage has revealed an interesting tool to investigate the 
fitness bottleneck in metabolic pathways or in gene networks like stringent response. 
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FIG. 1. Mean translation time t\ for 1530 Escherichia coli coding regions calculated using 
data on tRNA abundances published by Ikemura. The sequences are odered according to the values 
of 77 . The lower line shows n for the correct reading frame. The upper lines show the value of 
Ti for the +1 and +2 reading frames. The open squares correspond to ribosomal genes, the filled 
triangles correspond to genes carried by plasmids. 
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FIG. 2. Mean translation time tk calculated using the data on tRNA gene dosage by Komine. 
The sequences are ordered as in Fig. 1. The lower line shows tk for the correct reading frame. 
The upper lines show the value of tk for the +1 and +2 reading frames. 
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FIG. 3. CAI values of the coding sequences ordered as in Fig. 1. The upper line shows 
CAI values for the correct reading frame. The lower lines show CAI values for the +1 and +2 
reading frames. 



18 



